Coffee Shop Sales Analysis¶
Mini data project -exploring sales trends, product performance and time patterns.
***Author:** Sabína Mikušovičová Tools: Python (Pandas, Seaborn, Matplotlib
🎯 Objective¶
The goal of this project is to explore and visualize sales data from a coffee shop in order to identify:
- Top-selling coffee types
- Sales trends over time (years, months, hours)
- Busy hours and days of the week
- Average price evolution and total revenue growth
🧰 Tools & Libraries¶
- Python: pandas, numpy, matplotlib, seaborn
- Environment: JupyterLab
- Data visualization: Heatmaps, bar charts, line plots
- Dataset: Coffee shop transaction data (anonymized)
📂 Notebook Structure¶
- Data Loading & Preparation
- Exploratory Data Analysis (EDA)
- Time-based Trends
- Product Sales Analysis
- Insights & Conclusions
Data Loading and Preparation¶
In [52]:
import pandas as pd
import datetime as dt
import numpy as np
In [53]:
df = (
pd.read_csv('Coffe_sales.csv')
)
df.head()
Out[53]:
| hour_of_day | cash_type | money | coffee_name | Time_of_Day | Weekday | Month_name | Weekdaysort | Monthsort | Date | Time | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 10 | card | 38.7 | Latte | Morning | Fri | Mar | 5 | 3 | 2024-03-01 | 10:15:50.520000 |
| 1 | 12 | card | 38.7 | Hot Chocolate | Afternoon | Fri | Mar | 5 | 3 | 2024-03-01 | 12:19:22.539000 |
| 2 | 12 | card | 38.7 | Hot Chocolate | Afternoon | Fri | Mar | 5 | 3 | 2024-03-01 | 12:20:18.089000 |
| 3 | 13 | card | 28.9 | Americano | Afternoon | Fri | Mar | 5 | 3 | 2024-03-01 | 13:46:33.006000 |
| 4 | 13 | card | 38.7 | Latte | Afternoon | Fri | Mar | 5 | 3 | 2024-03-01 | 13:48:14.626000 |
Exploratory Data Analysis¶
In [55]:
df.head()
Out[55]:
| hour_of_day | cash_type | money | coffee_name | Time_of_Day | Weekday | Month_name | Weekdaysort | Monthsort | Date | Time | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 10 | card | 38.7 | Latte | Morning | Fri | Mar | 5 | 3 | 2024-03-01 | 10:15:50.520000 |
| 1 | 12 | card | 38.7 | Hot Chocolate | Afternoon | Fri | Mar | 5 | 3 | 2024-03-01 | 12:19:22.539000 |
| 2 | 12 | card | 38.7 | Hot Chocolate | Afternoon | Fri | Mar | 5 | 3 | 2024-03-01 | 12:20:18.089000 |
| 3 | 13 | card | 28.9 | Americano | Afternoon | Fri | Mar | 5 | 3 | 2024-03-01 | 13:46:33.006000 |
| 4 | 13 | card | 38.7 | Latte | Afternoon | Fri | Mar | 5 | 3 | 2024-03-01 | 13:48:14.626000 |
In [56]:
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3547 entries, 0 to 3546 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 hour_of_day 3547 non-null int64 1 cash_type 3547 non-null object 2 money 3547 non-null float64 3 coffee_name 3547 non-null object 4 Time_of_Day 3547 non-null object 5 Weekday 3547 non-null object 6 Month_name 3547 non-null object 7 Weekdaysort 3547 non-null int64 8 Monthsort 3547 non-null int64 9 Date 3547 non-null object 10 Time 3547 non-null object dtypes: float64(1), int64(3), object(7) memory usage: 304.9+ KB
In [57]:
df['Date']=pd.to_datetime(df['Date'])
df['Time']=pd.to_datetime(df['Time'])
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3547 entries, 0 to 3546 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 hour_of_day 3547 non-null int64 1 cash_type 3547 non-null object 2 money 3547 non-null float64 3 coffee_name 3547 non-null object 4 Time_of_Day 3547 non-null object 5 Weekday 3547 non-null object 6 Month_name 3547 non-null object 7 Weekdaysort 3547 non-null int64 8 Monthsort 3547 non-null int64 9 Date 3547 non-null datetime64[ns] 10 Time 3547 non-null datetime64[ns] dtypes: datetime64[ns](2), float64(1), int64(3), object(5) memory usage: 304.9+ KB
C:\Users\NB2\AppData\Local\Temp\ipykernel_11224\34177654.py:2: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format. df['Time']=pd.to_datetime(df['Time'])
In [58]:
df['coffee_name'].nunique()
Out[58]:
8
Product Sales Analysis¶
In [60]:
df["year"]=df['Date'].dt.year
df_sales=(
df.pivot_table(
index=["coffee_name","year",'Month_name'],
aggfunc={'money':['mean','sum']})
.droplevel(0, axis='columns')
.rename({'mean': 'avg_price', 'sum': 'total_units_sold'}, axis='columns')
)
df_sales.sample(10)
Out[60]:
| avg_price | total_units_sold | |||
|---|---|---|---|---|
| coffee_name | year | Month_name | ||
| Cappuccino | 2024 | Sep | 33.178537 | 1360.32 |
| Cortado | 2024 | Mar | 28.900000 | 809.20 |
| Espresso | 2025 | Mar | 21.060000 | 210.60 |
| Americano | 2025 | Jan | 25.960000 | 649.00 |
| 2024 | Nov | 25.960000 | 649.00 | |
| Cappuccino | 2024 | Dec | 35.760000 | 1358.88 |
| May | 37.720000 | 1961.44 | ||
| Cortado | 2024 | May | 27.920000 | 474.64 |
| Cocoa | 2024 | Mar | 38.700000 | 232.20 |
| Cortado | 2025 | Jan | 25.960000 | 571.12 |
In [61]:
df_total=(
df_sales
.assign(sales=lambda x:x['avg_price'] *x ['total_units_sold']).round(2)
.sort_values(['total_units_sold'],ascending=False)
.reset_index()
)
df_total
Out[61]:
| coffee_name | year | Month_name | avg_price | total_units_sold | sales | |
|---|---|---|---|---|---|---|
| 0 | Latte | 2024 | Oct | 35.76 | 4291.20 | 153453.31 |
| 1 | Latte | 2024 | Sep | 33.13 | 3114.48 | 103191.34 |
| 2 | Americano | 2025 | Feb | 25.96 | 3037.32 | 78848.83 |
| 3 | Americano with Milk | 2024 | Sep | 28.17 | 2930.14 | 82555.00 |
| 4 | Americano | 2025 | Mar | 25.96 | 2647.92 | 68740.00 |
| ... | ... | ... | ... | ... | ... | ... |
| 99 | Cocoa | 2024 | Jun | 37.72 | 150.88 | 5691.19 |
| 100 | Espresso | 2025 | Jan | 21.06 | 105.30 | 2217.62 |
| 101 | Espresso | 2024 | Apr | 24.00 | 96.00 | 2304.00 |
| 102 | Espresso | 2024 | Nov | 21.06 | 63.18 | 1330.57 |
| 103 | Cortado | 2025 | Mar | 25.96 | 51.92 | 1347.84 |
104 rows × 6 columns
In [62]:
filtered=(
df_total
.groupby('coffee_name')['sales']
.sum()
.sort_values(ascending=False)
.head(3)
.index.tolist()
)
filtered
Out[62]:
['Latte', 'Americano with Milk', 'Cappuccino']
In [63]:
df_top3=df_total.query("coffee_name in @filtered")
df_top3.sample(7)
Out[63]:
| coffee_name | year | Month_name | avg_price | total_units_sold | sales | |
|---|---|---|---|---|---|---|
| 25 | Americano with Milk | 2025 | Jan | 30.86 | 1604.72 | 49521.66 |
| 0 | Latte | 2024 | Oct | 35.76 | 4291.20 | 153453.31 |
| 22 | Americano with Milk | 2024 | Dec | 30.86 | 1759.02 | 54283.36 |
| 18 | Americano with Milk | 2024 | Jul | 28.67 | 1863.80 | 53442.31 |
| 27 | Americano with Milk | 2025 | Mar | 30.86 | 1604.72 | 49521.66 |
| 24 | Latte | 2024 | Dec | 35.76 | 1680.72 | 60102.55 |
| 1 | Latte | 2024 | Sep | 33.13 | 3114.48 | 103191.34 |
In [64]:
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
In [65]:
fig_units=(
px.bar(
df_top3,
x="Month_name",
y="total_units_sold",
color='coffee_name',
barmode='group',
facet_col='year',
category_orders={
'Month_name': ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'],
'year':[2024,2025]}
)
.update_layout(
title='Monthly Trends in Units Sold of TOP 3 products (March 2024- March 2025)',
title_x=0.5,
xaxis_title=' ',
yaxis_title=' '
)
)
fig_units.show()
- The graph shows the monthly number of coffees sold between March 2024 and March 2025. Sales gradually increase until autumn, peaking between September and November, before declining slightly over the winter months.
In [67]:
fig_sales=(
px.bar(
df_top3,
x="Month_name",
y="sales",
color='coffee_name',
barmode='group',
facet_col='year',
category_orders={
'Month_name': ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'],
'year':[2024,2025]}
)
.update_layout(
title='Monthly Sales Performance of TOP 3 products (March 2024- March 2025)',
title_x=0.5,
xaxis_title=' ',
yaxis_title=' '
)
)
fig_sales.show()
- This graph illustrates the café´s total sales over the same period in 2 years. The trend closely follows the unit sold, indicating a strong positive correlation between the quantity sold and overall revenue.
Coffee products performance from March 2024 to March 2025¶
The analysis revelas that:
- The coffee_shop best season is from September to November
- The Q1 of 2025 outperformed the Q1 of 2024
- Higher sales volumes correspond to higher total revenue
In [70]:
sns.relplot(
data=df_total,
x='year',y ='avg_price',
kind="line",
hue="coffee_name",
marker="o",
height=3, aspect=1.5
)
plt.title("Coffee Price Trends (March 2024 - March 2025)")
plt.xlabel('')
plt.ylabel('')
plt.show()
The graph shows that the prices remained stable throughout the period, showing neither a noticeable increase nor a decrease.
Time-Based Sales and Traffic Analysis¶
In [73]:
df.head()
Out[73]:
| hour_of_day | cash_type | money | coffee_name | Time_of_Day | Weekday | Month_name | Weekdaysort | Monthsort | Date | Time | year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 10 | card | 38.7 | Latte | Morning | Fri | Mar | 5 | 3 | 2024-03-01 | 2025-10-14 10:15:50.520 | 2024 |
| 1 | 12 | card | 38.7 | Hot Chocolate | Afternoon | Fri | Mar | 5 | 3 | 2024-03-01 | 2025-10-14 12:19:22.539 | 2024 |
| 2 | 12 | card | 38.7 | Hot Chocolate | Afternoon | Fri | Mar | 5 | 3 | 2024-03-01 | 2025-10-14 12:20:18.089 | 2024 |
| 3 | 13 | card | 28.9 | Americano | Afternoon | Fri | Mar | 5 | 3 | 2024-03-01 | 2025-10-14 13:46:33.006 | 2024 |
| 4 | 13 | card | 38.7 | Latte | Afternoon | Fri | Mar | 5 | 3 | 2024-03-01 | 2025-10-14 13:48:14.626 | 2024 |
In [74]:
Weekday_order=['Mon','Tue','Wed', 'Thu','Fri', 'Sat', 'Sun']
df['Weekday']=pd.Categorical(df['Weekday'],categories=Weekday_order,ordered=True)
In [75]:
Time_of_day_order=['Morning','Afternoon','Night']
df['Time_of_Day']=pd.Categorical(df['Time_of_Day'],categories=Time_of_day_order,ordered=True)
In [76]:
pivot_days=(
df.pivot_table(
index=["Weekday",'Time_of_Day'],
aggfunc={'money':['sum']})
.droplevel(1, axis='columns')
.rename({ 'sum': 'total_units_sold'}, axis='columns')
.unstack()
.droplevel(0, axis='columns')
)
pivot_days
C:\Users\NB2\AppData\Local\Temp\ipykernel_11224\2416743940.py:2: FutureWarning: The default value of observed=False is deprecated and will change to observed=True in a future version of pandas. Specify observed=False to silence this warning and retain the current behavior
Out[76]:
| Time_of_Day | Morning | Afternoon | Night |
|---|---|---|---|
| Weekday | |||
| Mon | 5987.34 | 5664.10 | 5711.66 |
| Tue | 6343.92 | 5012.08 | 6812.38 |
| Wed | 5071.32 | 5147.76 | 5531.38 |
| Thu | 4419.32 | 5203.58 | 6468.50 |
| Fri | 5839.36 | 5504.90 | 5458.40 |
| Sat | 4570.62 | 6307.30 | 3855.60 |
| Sun | 3697.32 | 5290.32 | 4348.42 |
In [77]:
pivot_days_total=pivot_days.copy()
pivot_days_total['Total']=pivot_days.sum(axis=1)
pivot_days_total
Out[77]:
| Time_of_Day | Morning | Afternoon | Night | Total |
|---|---|---|---|---|
| Weekday | ||||
| Mon | 5987.34 | 5664.10 | 5711.66 | 17363.10 |
| Tue | 6343.92 | 5012.08 | 6812.38 | 18168.38 |
| Wed | 5071.32 | 5147.76 | 5531.38 | 15750.46 |
| Thu | 4419.32 | 5203.58 | 6468.50 | 16091.40 |
| Fri | 5839.36 | 5504.90 | 5458.40 | 16802.66 |
| Sat | 4570.62 | 6307.30 | 3855.60 | 14733.52 |
| Sun | 3697.32 | 5290.32 | 4348.42 | 13336.06 |
In [78]:
plt.figure(figsize=(10,6))
sns.heatmap(pivot_days, annot=True, fmt='.0f',cmap='YlGnBu')
plt.title( 'Product Sales Heatmap by Time of Day')
plt.xlabel(' ')
plt.ylabel(' ')
plt.show()
- The heatmap shows the café´s sales distribution across days of week and times of day, The results reveal clear peak periods, with the highest activity observed during late mornings and early afternoons, particularly on Fridays and weekends.
In [80]:
df_coffee=(
df.pivot_table(
index=["coffee_name","Time_of_Day"],
aggfunc={'money':['sum']})
.droplevel(1, axis='columns')
.rename({'sum':'total_units_sold'}, axis='columns')
.unstack()
.droplevel(0,axis='columns')
)
df_coffee
C:\Users\NB2\AppData\Local\Temp\ipykernel_11224\3694376587.py:2: FutureWarning: The default value of observed=False is deprecated and will change to observed=True in a future version of pandas. Specify observed=False to silence this warning and retain the current behavior
Out[80]:
| Time_of_Day | Morning | Afternoon | Night |
|---|---|---|---|
| coffee_name | |||
| Americano | 5643.10 | 6133.94 | 2873.22 |
| Americano with Milk | 10025.52 | 7384.36 | 7341.24 |
| Cappuccino | 4327.44 | 5910.70 | 7201.00 |
| Cocoa | 2059.38 | 2685.92 | 3775.86 |
| Cortado | 3605.46 | 2315.84 | 1463.56 |
| Espresso | 873.72 | 1189.16 | 627.40 |
| Hot Chocolate | 1744.40 | 2899.02 | 5290.04 |
| Latte | 7650.18 | 9611.10 | 9614.02 |
In [81]:
pivot_coffee=df.pivot_table(values='money',
index='coffee_name',
columns='Time_of_Day',
aggfunc='sum')
pivot_coffee['Total']=pivot_coffee.sum(axis=1)
pivot_coffee
C:\Users\NB2\AppData\Local\Temp\ipykernel_11224\2230699132.py:1: FutureWarning: The default value of observed=False is deprecated and will change to observed=True in a future version of pandas. Specify observed=False to silence this warning and retain the current behavior
Out[81]:
| Time_of_Day | Morning | Afternoon | Night | Total |
|---|---|---|---|---|
| coffee_name | ||||
| Americano | 5643.10 | 6133.94 | 2873.22 | 14650.26 |
| Americano with Milk | 10025.52 | 7384.36 | 7341.24 | 24751.12 |
| Cappuccino | 4327.44 | 5910.70 | 7201.00 | 17439.14 |
| Cocoa | 2059.38 | 2685.92 | 3775.86 | 8521.16 |
| Cortado | 3605.46 | 2315.84 | 1463.56 | 7384.86 |
| Espresso | 873.72 | 1189.16 | 627.40 | 2690.28 |
| Hot Chocolate | 1744.40 | 2899.02 | 5290.04 | 9933.46 |
| Latte | 7650.18 | 9611.10 | 9614.02 | 26875.30 |
In [82]:
plt.figure(figsize=(10,6))
sns.heatmap(df_coffee, annot=True, fmt='.0f',cmap='YlGnBu')
plt.title( ' Coffee Sales by Time of Day')
plt.xlabel('')
plt.ylabel('')
plt.show()
- The heatmap illustrates when the highest number of coffees are sold during a day
Time-based coffee sales overview¶
The analysis of sales performance by day and dime reveals, that:
- In the mornings, customers prefer lattes and americanos with milk
- Cappuccinos perform better in the afternoon
- Espressos have the lowest performance throughout the day
In [85]:
df_hour=(
df
.pivot_table(index='hour_of_day', aggfunc={'money': 'sum'})
.sort_values(by='hour_of_day',ascending=True)
.reset_index()
)
df_hour.head()
Out[85]:
| hour_of_day | money | |
|---|---|---|
| 0 | 6 | 149.40 |
| 1 | 7 | 2846.02 |
| 2 | 8 | 7017.88 |
| 3 | 9 | 7264.28 |
| 4 | 10 | 10198.52 |
In [86]:
fig_hour =(
px.line(df_hour, x="hour_of_day", y="money")
.update_layout(
title='Hourly Coffee Sales',
title_x=0.5,
xaxis_title='Hour of Day',
yaxis_title='',
template = 'plotly_white'
)
)
fig_hour.show()
⏰ Peak Hours Analysis¶
The analysis of hourly sales reveals that:
- The busiest hours are 8:00–10:00 AM, which corresponds to the morning coffee rush.
- A smaller peak is also visible around 3:00–4:00 PM, likely corresponding to afternoon breaks.
- Sales drop significantly after 6:00 PM, indicating low evening demand.
🧠 Conclusions & Insights¶
- ☕ Latte was the most popular product across all years.
- 💡 Prices of coffes stay stable.
- 📈 Sales increased in Q1 2025 despite prices stability suggests higher customer volume.
- ⏰ The busiest hours were between 8:00 and 10:00 AM.
- 🗓️ Weekdays showed higher activity than weekends, consistent with office-hour coffee demand.
These information can be used to optimize staffing schedules or promotional campaings during off-peak hours.
In [89]:
from IPython.display import HTML
HTML(
'''
<script>
code_show = true;
function code_toggle() {
if (code_show) {
document.querySelectorAll('.jp-CodeCell .jp-InputArea').forEach(function(el) {
el.style.display = 'none';
});
} else {
document.querySelectorAll('.jp-CodeCell .jp-InputArea').forEach(function(el) {
el.style.display = 'block';
});
}
code_show = !code_show
}
</script>
<form action="javascript:code_toggle()">
<input type="submit" value="Show / Hide code">
</form>
'''
)
Out[89]:
Analysis of factors impacting the sales performance (Decision Tree Model)¶
Objective:¶
- Identification of features, which have the biggest impact on the units sold in the coffee shop
Model used: DecisionTreeRegressor¶
In [93]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error
In [126]:
df.sample(10)
Out[126]:
| hour_of_day | cash_type | money | coffee_name | Time_of_Day | Weekday | Month_name | Weekdaysort | Monthsort | Date | Time | year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 841 | 19 | card | 27.92 | Americano with Milk | Night | Sat | Jul | 6 | 7 | 2024-07-06 | 2025-10-14 19:10:39.580 | 2024 |
| 1453 | 12 | card | 23.02 | Cortado | Afternoon | Thu | Sep | 4 | 9 | 2024-09-12 | 2025-10-14 12:32:22.981 | 2024 |
| 1207 | 21 | card | 27.92 | Americano with Milk | Night | Thu | Aug | 4 | 8 | 2024-08-15 | 2025-10-14 21:36:11.033 | 2024 |
| 3340 | 15 | card | 35.76 | Cappuccino | Afternoon | Sun | Mar | 7 | 3 | 2025-03-09 | 2025-10-14 15:12:41.404 | 2025 |
| 1047 | 11 | card | 27.92 | Americano with Milk | Morning | Thu | Aug | 4 | 8 | 2024-08-01 | 2025-10-14 11:22:46.369 | 2024 |
| 1448 | 10 | card | 18.12 | Espresso | Morning | Thu | Sep | 4 | 9 | 2024-09-12 | 2025-10-14 10:57:25.619 | 2024 |
| 1537 | 22 | card | 27.92 | Americano with Milk | Night | Sat | Sep | 6 | 9 | 2024-09-21 | 2025-10-14 22:18:46.088 | 2024 |
| 2683 | 7 | card | 35.76 | Latte | Morning | Tue | Jan | 2 | 1 | 2025-01-14 | 2025-10-14 07:37:42.461 | 2025 |
| 3494 | 12 | card | 25.96 | Americano | Afternoon | Thu | Mar | 4 | 3 | 2025-03-20 | 2025-10-14 12:48:24.389 | 2025 |
| 3008 | 17 | card | 30.86 | Americano with Milk | Night | Fri | Feb | 5 | 2 | 2025-02-14 | 2025-10-14 17:25:24.871 | 2025 |
In [182]:
df_2=(
df.pivot_table(
index=['Month_name','Weekday',"coffee_name",'Time_of_Day',],
aggfunc={'money':['mean','sum']})
.droplevel(0, axis='columns')
.rename({'mean': 'avg_price', 'sum': 'total_units_sold'}, axis='columns')
)
df_2.sample(10)
C:\Users\NB2\AppData\Local\Temp\ipykernel_11224\2766130499.py:2: FutureWarning: The default value of observed=False is deprecated and will change to observed=True in a future version of pandas. Specify observed=False to silence this warning and retain the current behavior
Out[182]:
| avg_price | total_units_sold | ||||
|---|---|---|---|---|---|
| Month_name | Weekday | coffee_name | Time_of_Day | ||
| Dec | Tue | Hot Chocolate | Night | 35.76 | 143.04 |
| Sun | Hot Chocolate | Afternoon | NaN | 0.00 | |
| Feb | Fri | Americano with Milk | Morning | 30.86 | 185.16 |
| Mon | Cocoa | Afternoon | 35.76 | 214.56 | |
| Apr | Thu | Cortado | Night | NaN | 0.00 |
| Feb | Sat | Americano with Milk | Morning | 30.86 | 92.58 |
| Jun | Fri | Hot Chocolate | Morning | NaN | 0.00 |
| Nov | Wed | Latte | Afternoon | 35.76 | 107.28 |
| Jul | Thu | Espresso | Afternoon | NaN | 0.00 |
| Dec | Sat | Cortado | Morning | 25.96 | 51.92 |
In [186]:
df_2=df_2.reset_index()
df_2
Out[186]:
| index | Month_name | Weekday | coffee_name | Time_of_Day | avg_price | total_units_sold | |
|---|---|---|---|---|---|---|---|
| 0 | 0 | Apr | Mon | Americano | Morning | 28.655 | 114.62 |
| 1 | 1 | Apr | Mon | Americano | Afternoon | 27.920 | 27.92 |
| 2 | 2 | Apr | Mon | Americano | Night | 28.900 | 57.80 |
| 3 | 3 | Apr | Mon | Americano with Milk | Morning | 33.604 | 168.02 |
| 4 | 4 | Apr | Mon | Americano with Milk | Afternoon | 32.820 | 32.82 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 2011 | 2011 | Sep | Sun | Hot Chocolate | Afternoon | NaN | 0.00 |
| 2012 | 2012 | Sep | Sun | Hot Chocolate | Night | 32.820 | 65.64 |
| 2013 | 2013 | Sep | Sun | Latte | Morning | 32.820 | 32.82 |
| 2014 | 2014 | Sep | Sun | Latte | Afternoon | 34.584 | 172.92 |
| 2015 | 2015 | Sep | Sun | Latte | Night | 33.408 | 167.04 |
2016 rows × 7 columns
In [192]:
df_2.drop(columns=['index'])
Out[192]:
| Month_name | Weekday | coffee_name | Time_of_Day | avg_price | total_units_sold | |
|---|---|---|---|---|---|---|
| 0 | Apr | Mon | Americano | Morning | 28.655 | 114.62 |
| 1 | Apr | Mon | Americano | Afternoon | 27.920 | 27.92 |
| 2 | Apr | Mon | Americano | Night | 28.900 | 57.80 |
| 3 | Apr | Mon | Americano with Milk | Morning | 33.604 | 168.02 |
| 4 | Apr | Mon | Americano with Milk | Afternoon | 32.820 | 32.82 |
| ... | ... | ... | ... | ... | ... | ... |
| 2011 | Sep | Sun | Hot Chocolate | Afternoon | NaN | 0.00 |
| 2012 | Sep | Sun | Hot Chocolate | Night | 32.820 | 65.64 |
| 2013 | Sep | Sun | Latte | Morning | 32.820 | 32.82 |
| 2014 | Sep | Sun | Latte | Afternoon | 34.584 | 172.92 |
| 2015 | Sep | Sun | Latte | Night | 33.408 | 167.04 |
2016 rows × 6 columns
In [194]:
y=df_2['total_units_sold']
coffee_features=['Month_name','Weekday','Time_of_Day','coffee_name','avg_price']
In [196]:
X=df_2[coffee_features]
In [198]:
X.head()
Out[198]:
| Month_name | Weekday | Time_of_Day | coffee_name | avg_price | |
|---|---|---|---|---|---|
| 0 | Apr | Mon | Morning | Americano | 28.655 |
| 1 | Apr | Mon | Afternoon | Americano | 27.920 |
| 2 | Apr | Mon | Night | Americano | 28.900 |
| 3 | Apr | Mon | Morning | Americano with Milk | 33.604 |
| 4 | Apr | Mon | Afternoon | Americano with Milk | 32.820 |
In [214]:
X.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2016 entries, 0 to 2015 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Month_name 2016 non-null object 1 Weekday 2016 non-null category 2 Time_of_Day 2016 non-null category 3 coffee_name 2016 non-null object 4 avg_price 1311 non-null float64 dtypes: category(2), float64(1), object(2) memory usage: 51.8+ KB
In [220]:
string_cols=X.select_dtypes(exclude='float').columns
string_cols
Out[220]:
Index(['Month_name', 'Weekday', 'Time_of_Day', 'coffee_name'], dtype='object')
In [236]:
X_final=pd.get_dummies(X,columns=string_cols)
X_final
Out[236]:
| avg_price | Month_name_Apr | Month_name_Aug | Month_name_Dec | Month_name_Feb | Month_name_Jan | Month_name_Jul | Month_name_Jun | Month_name_Mar | Month_name_May | ... | Time_of_Day_Afternoon | Time_of_Day_Night | coffee_name_Americano | coffee_name_Americano with Milk | coffee_name_Cappuccino | coffee_name_Cocoa | coffee_name_Cortado | coffee_name_Espresso | coffee_name_Hot Chocolate | coffee_name_Latte | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 28.655 | True | False | False | False | False | False | False | False | False | ... | False | False | True | False | False | False | False | False | False | False |
| 1 | 27.920 | True | False | False | False | False | False | False | False | False | ... | True | False | True | False | False | False | False | False | False | False |
| 2 | 28.900 | True | False | False | False | False | False | False | False | False | ... | False | True | True | False | False | False | False | False | False | False |
| 3 | 33.604 | True | False | False | False | False | False | False | False | False | ... | False | False | False | True | False | False | False | False | False | False |
| 4 | 32.820 | True | False | False | False | False | False | False | False | False | ... | True | False | False | True | False | False | False | False | False | False |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2011 | NaN | False | False | False | False | False | False | False | False | False | ... | True | False | False | False | False | False | False | False | True | False |
| 2012 | 32.820 | False | False | False | False | False | False | False | False | False | ... | False | True | False | False | False | False | False | False | True | False |
| 2013 | 32.820 | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | True |
| 2014 | 34.584 | False | False | False | False | False | False | False | False | False | ... | True | False | False | False | False | False | False | False | False | True |
| 2015 | 33.408 | False | False | False | False | False | False | False | False | False | ... | False | True | False | False | False | False | False | False | False | True |
2016 rows × 31 columns
In [252]:
train_X, val_X, train_y, val_y = train_test_split(X_final, y, random_state = 0)
In [248]:
print(X_final.index)
RangeIndex(start=0, stop=2016, step=1)
In [246]:
print(y.index)
RangeIndex(start=0, stop=2016, step=1)
In [420]:
model=DecisionTreeRegressor(max_depth=7,min_samples_split=10,random_state=1)
model.fit(train_X,train_y)
Out[420]:
DecisionTreeRegressor(max_depth=7, min_samples_split=10, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeRegressor(max_depth=7, min_samples_split=10, random_state=1)
In [422]:
predictions=model.predict(val_X)
print(predictions[:10])
print(val_y.head(10).values)
[ 51.8640678 120.85714286 0. 0. 146.58 0. 51.8640678 30.08571429 91.36468085 51.8640678 ] [ 27.92 113.16 0. 0. 83.76 0. 51.92 42.12 35.76 25.96]
In [334]:
from sklearn.metrics import mean_absolute_error, r2_score
In [424]:
mae=mean_absolute_error(predictions,val_y)
r2=r2_score(predictions,val_y)
print(f"mae{mae:.2f}")
print(f"R2{r2:.3f}")
mae29.20 R20.251
In [426]:
importance=pd.DataFrame({
'Feature':train_X.columns,
'Importance':model.feature_importances_}).sort_values(by='Importance',ascending=False)
print(importance.head(10))
Feature Importance 0 avg_price 0.654262 23 coffee_name_Americano 0.057551 30 coffee_name_Latte 0.038540 24 coffee_name_Americano with Milk 0.034684 8 Month_name_Mar 0.032512 22 Time_of_Day_Night 0.029497 4 Month_name_Feb 0.025779 13 Weekday_Mon 0.020002 11 Month_name_Oct 0.017722 18 Weekday_Sat 0.017041
- The highest impact for sale have price, type of coffee and month of a year.
In [406]:
y_train_pred=model.predict(train_X)
y_test_pred=model.predict(val_X)
In [428]:
plt.figure(figsize=(8,6))
plt.scatter(train_y,y_train_pred,label='Train')
plt.scatter(val_y,y_test_pred,label='Test')
plt.plot([y.min(),y.max()],[y.min(),y.max()],'r--')
plt.xlabel('Real sales')
plt.ylabel('Predict sales')
plt.legend()
plt.show()